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> ! Abstract 

o i 

^ | | The CUR decomposition provides an approximation of a matrix X that has low 

reconstruction error and that is sparse in the sense that the resulting approximation 
lies in the span of only a few columns of X. In this regard, it appears to be similar 
to many sparse PCA methods. However, CUR takes a randomized algorithmic 
■ approach, whereas most sparse PCA methods are framed as convex optimization 

| problems. In this paper, we try to understand CUR from a sparse optimization 

viewpoint. We show that CUR is implicitly optimizing a sparse regression objec- 
tive and, furthermore, cannot be directly cast as a sparse PCA method. We also 
observe that the sparsity attained by CUR possesses an interesting structure, which 
leads us to formulate a sparse PCA method that achieves a CUR-like sparsity. 
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1 Introduction 

CUR decompositions are a recently-popular class of randomized algorithms that approximate a data 
matrix X G M. nxp by using only a small number of actual columns of X [12, 4]. CUR decomposi- 
tions are often described as S VD-like low-rank decompositions that have the additional advantage of 
being easily interpretable to domain scientists. The motivation to produce a more interpretable low- 
rank decomposition is also shared by sparse PCA (SPCA) methods, which are optimization-based 
procedures that have been of interest recently in statistics and machine learning. 

Although CUR and SPCA methods start with similar motivations, they proceed very differently. For 
example, most CUR methods have been randomized, and they take a purely algorithmic approach. 
By contrast, most SPCA methods start with a combinatorial optimization problem, and they then 
solve a relaxation of this problem. Thus far, it has not been clear to researchers how the CUR and 
SPCA approaches are related. It is the purpose of this paper to understand CUR decompositions 
from a sparse optimization viewpoint, thereby elucidating the connection between CUR decompo- 
sitions and the SPCA class of sparse optimization methods. 

To do so, we begin by putting forth a combinatorial optimization problem (see (6) below) which 
CUR is implicitly approximately optimizing. This formulation will highlight two interesting features 
of CUR: first, CUR attains a distinctive pattern of sparsity, which has practical implications from 
the SPCA viewpoint; and second, CUR is implicitly optimizing a regression-type objective. These 
two observations then lead to the three main contributions of this paper: (a) first, we formulate a 
non-randomized optimization-based version of CUR (see Problem 1: GL-Reg in Section 3) that is 
based on a convex relaxation of the CUR combinatorial optimization problem; (b) second, we show 
that, in contrast to the original PCA-based motivation for CUR, CUR's implicit objective cannot 
be directly expressed in terms of a PCA-type objective (see Theorem 3 in Section 4); and (c) third, 
we propose an SPCA approach (see Problem 2: GL-SPCA in Section 5) that achieves the sparsity 
structure of CUR within the PCA framework. We also provide a brief empirical evaluation of our 
two proposed objectives. While our proposed GL-Reg and GL-SPCA methods are promising in 
and of themselves, our purpose in this paper is not to explore them as alternatives to CUR; instead, 
our goal is to use them to help clarify the connection between CUR and SPCA methods. 



* Jacob Bien and Ya Xu contributed equally. 
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We conclude this introduction with some remarks on notation. Given a matrix A, we use A(j) to 
denote its ith row (as a row-vector) and AW its ith column. Similarly, given a set of indices 1, 
Ax and A x denote the submatrices of A containing only these X rows and columns, respectively. 
Finally, we let C co \ (A) denote the column space of A. 

2 Background 

In this section, we provide a brief background on CUR and SPCA methods, with a particular em- 
phasis on topics to which we will return in subsequent sections. Before doing so, recall that, given 
an input matrix X, Principal Component Analysis (PCA) seeks the fc-dimensional hyperplane with 
the lowest reconstruction error. That is, it computes a p x k orthogonal matrix W that minimizes 

err(W) = ||X - XWW T || F . (1) 

Writing the SVD of X as USV T , the minimizer of (1) is given by V/j, the first k columns of V. In 
the data analysis setting, each column of V provides a particular linear combination of the columns 
of X. These linear combinations are often thought of as latent factors. In many applications, in- 
terpreting such factors is made much easier if they are comprised of only a small number of actual 
columns of X, which is equivalent to only having a small number of nonzero elements. 

2.1 CUR matrix decompositions 

CUR decompositions were proposed by Drineas and Mahoney [12, 4] to provide a low-rank approx- 
imation to a data matrix X by using only a small number of actual columns and/or rows of X. Fast 
randomized variants [3], deterministic variants [5], Nystrom-based variants [1, 11], and heuristic 
variants [17] have also been considered. Observing that the best rank-fc approximation to the SVD 
provides the best set of k linear combinations of all the columns, one can ask for the best set of k 
actual columns. Most formalizations of "best" lead to intractable combinatorial optimization prob- 
lems [12], but one can take advantage of oversampling (choosing slightly more than k columns) and 
randomness as computational resources to obtain strong quality-of-approximation guarantees. 
Theorem 1 (Relative-error CUR [12]). Given an arbitrary matrix X G W ixp and an integer k, 
there exists a randomized algorithm that chooses a random subset X C {1, . . . ,p} of size c = 
0(k log k log(l/<5) /e 2 ) such that X x , the nxc submatrix containing those c columns o/X, satisfies 

||X-X I X I +X|| F = min ||X - X Z B|| F < (1 + e)||X - X k \\ F , (2) 
with probability at least 1 — 5, where X& is the best rank k approximation to X. 
The algorithm referred to by Theorem 1 is very simple: 

1) Compute the normalized statistical leverage scores, defined below in (3). 

2) Form X by randomly sampling c columns of X, using these normalized statistical leverage scores 
as an importance sampling distribution. 

3) Return the nxc matrix X x consisting of these selected columns. 

The key issue here is the choice of the importance sampling distribution. Let the p x k matrix Vfc 
be the top-fc right singular vectors of X. Then the normalized statistical leverage scores are 

n i = ^ll v fe(i)lli ( 3 ) 

for all i = 1, . . . ,p, where ~Vh(i) denotes the i-th row of Vfe. These scores, proportional to the 
Euclidean norms of the rows of the top-fc right singular vectors, define the relevant nonuniformity 
structure to be used to identify good (in the sense of Theorem 1) columns. In addition, these scores 
are proportional to the diagonal elements of the projection matrix onto the top-fc right singular 
subspace. Thus, they generalize the so-called hat matrix [8], and they have a natural interpretation 
as capturing the "statistical leverage" or "influence" of a given column on the best low-rank fit of 
the data matrix [8, 12]. 

2.2 Regularized sparse PCA methods 

SPCA methods attempt to make PCA easier to interpret for domain experts by finding sparse approx- 
imations to the columns of V. 1 There are several variants of SPCA. For example, Jolliffe et al. [10] 

'For SPCA, we only consider sparsity in the right singular vectors V and not in the left singular vectors U. 
This is similar to considering only the choice of columns and not of both columns and rows in CUR. 
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and Witten et al. [19] use the maximum variance interpretation of PCA and provide an optimization 
problem which explicitly encourages sparsity in V based on a Lasso constraint [18]. d'Aspremont 
et al. [2] take a similar approach, but instead formulate the problem as an SDR 

Zou et al. [21] use the minimum reconstruction error interpretation of PCA to suggest a different 
approach to the SPCA problem; this formulation will be most relevant to our present purpose. They 
begin by formulating PCA as the solution to a regression-type problem. 

Theorem 2 (Zou et al. [21]). Given an arbitrary matrix X e I" xp and an integer k, let A and W 
be p x k matrices. Then, for any A > 0, let 

(A*,V£) =argmin AjWeRpxfc ||X~XWA T |||, + A||W||| s.t. A T A = I fc . (4) 

Then, the minimizing matrices A* and~V\ satisfy A* : W = SjV^ andV*^ — Si ^j^ V^'), where 
Si = 1 or —1. 

That is, up to signs, A* consists of the top-fc right singular vectors of X, and V£ consists of 
those same vectors "shrunk" by a factor depending on the corresponding singular value. Given this 
regression-type characterization of PCA, Zou et al. [21] then "sparsify" the formulation by adding 
an L x penalty on W: 

(A*, Y* k ) = argmin A , WeRP x fc ||X - XWA T |||, + A||W||| + Ai||W||i s.t. A T A = I fc , (5) 
where ||W||i = |Wy|. This regularization tends to sparsify W element-wise, so that the 
solution V* k gives a sparse approximation of Vfe. 

3 Expressing CUR as an optimization problem 

In this section, we present an optimization formulation of CUR. Recall, from Section 2. 1, that CUR 
takes a purely algorithmic approach to the problem of approximating a matrix in terms of a small 
number of its columns. That is, it achieves sparsity indirectly by randomly selecting c columns, and 
it does so in such a way that the reconstruction error is small with high probability (Theorem 1). By 
contrast, SPCA methods are generally formulated as the exact solution to an optimization problem. 

From Theorem 1 , it is clear that CUR seeks a subset I of size c for which min BeRc x P \ | X — X 1 B 1 1 p 
is small. In this sense, CUR can be viewed as a randomized algorithm for approximately solving the 
following combinatorial optimization problem: 

min min ||X-X I B|| f s.t. Ill < c. (6) 

IC{1, ...,J>} B6I«p 

In words, this objective asks for the subset of c columns of X which best describes the entire matrix 
X. Notice that relaxing \X\ = c to \I\ < c does not affect the optimum. This optimization problem 
is analogous to all-subsets multivariate regression [7], which is known to be NP-hard. 

However, by using ideas from the optimization literature we can approximate this combinatorial 
problem as a regularized regression problem that is convex. First, notice that (6) is equivalent to 

p 

B^p n x P l|X ~ XB||f H 1 {H B wll^o} <c, (7) 

i— 1 

where we now optimize over apxp matrix B. To see the equivalence between (6) and (7), note that 
the constraint in (7) is the same as finding some subset 1 with \I\ < c such that Bjc = 0. 

The formulation in (7) provides a natural entry point to proposing a convex optimization approach 
corresponding to CUR. First notice that (7) uses an L norm on the rows of B, which is not convex. 
However, we can approximate the L constraint by a group lasso penalty, which uses a well-known 
convex heuristic proposed by Yuan et al. [20] that encourages prespecified groups of parameters 
to be simultaneously sparse. Thus, the combinatorial problem in (6) can be approximated by the 
following convex (and thus tractable) problem: 

Problem 1 (Group lasso regression: GL-REG). Given an arbitrary matrix X G W ixp , let B G 

W xp and t > 0. The GL-REG problem is to solve 

p 

B* =argmin B ||X-XB|| F s.t. ^ ||B (j) || 2 < t, (8) 
where t is chosen to get c nonzero rows in B*. 
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Since the rows of B are grouped together in the penalty Y^i= 1 1 1 1 1 2 > the row vector B will tend 
to be either dense or entirely zero. Note also that the algorithm to solve Problem 1 is a special case 
of Algorithm 1 (see below), which solves the GL-SPCA problem, to be introduced later. (Finally, 
as a side remark, note that our proposed GL-Reg is strikingly similar to a recently proposed method 
for sparse inverse covariance estimation [6, 15].) 

4 Distinguishing CUR from SPCA 

Our original intention in casting CUR in the optimization framework was to understand better 
whether CUR could be seen as an SPCA-type method. So far, we have established CUR's con- 
nection to regression by showing that CUR can be thought of as an approximation algorithm for the 
sparse regression problem (7). In this section, we discuss the relationship between regression and 
PCA, and we show that CUR cannot be directly cast as an SPCA method. 

To do this, recall that regression, in particular "self" regression, finds a B 6 W xp that minimizes 

||X-XB|| F . (9) 

On the other hand, PCA-type methods find a set of directions W that minimize 

ERR(W) := ||X-XWW+|| F . (10) 

Here, unlike in (1), we do not assume that W is orthogonal, since the minimizer produced from 
SPCA methods is often not required to be orthogonal (recall Section 2.2). 

Clearly, with no constraints on B or W, we can trivially achieve zero reconstruction error in both 
cases by taking B = I p and W any p x p full-rank matrix. However, with additional constraints, 
these two problems can be very different. It is common to consider sparsity and/or rank constraints. 
We have seen in Section 3 that CUR effectively requires B to be row-sparse; in the standard PCA 
setting, W is taken to be rank k (with k < p), in which case (10) is minimized by Vfe and obtains 
the optimal value ERR(V fe ) = | |X - X fc | \ F ; finally, for SPCA, W is further required to be sparse. 

To illustrate the difference between the reconstruction errors (9) and (10) when extra constraints 
are imposed, consider the 2-dimensional toy example in Figure 1. In this example, we compare 
regression with a row-sparsity constraint to PCA with both rank and sparsity constraints. With 
X e R" x2 , we plot X^ 2 ) against X^ as the solid points in both plots of Figure 1. Constraining 
B( 2 ) = (giving row-sparsity, as with CUR methods), (9) becomes mins 12 |X^ 2 ) — XW.B12II2, 
which is a simple linear regression, represented by the black thick line and minimizing the sum 
of squared vertical errors as shown. The red line (left plot) shows the first principal component 
direction, which minimizes ERR(W) among all rank-one matrices W. Here, ERR(W) is the sum 
of squared projection distances (red dotted lines). Finally, if W is further required to be sparse in 
the X^ 2 ) direction (as with SPCA methods), we get the rank-one, sparse projection represented by 
the green line in Figure 1 (right). The two sets of dotted lines in each plot clearly differ, indicating 
that their corresponding reconstruction errors are different as well. Since we have shown that CUR 
is minimizing a regression-based objective, this toy example suggests that CUR may not in fact be 
optimizing a PCA-type objective such as (10). Next, we will make this intuition more precise. 

The first step to showing that CUR is an SPCA method would be to produce a matrix V CUR for 
which X X X I+ X = XV CUR V<t UR , i.e. to express CUR's approximation in the form of an SPCA 
approximation. However, this equality implies £ co i(XV CUR V+ UR ) C ^coilX 1 ), meaning that 
(V CUR )ic = 0. If such a V CUR existed, then clearly ERR(V CUR ) = ||X - X^^XHf, and so 
CUR could be regarded as implicitly performing sparse PCA in the sense that (a) V CUR is sparse; 
and (b) by Theorem 1 (with high probability), ERR(V CUR ) < (1 + e)ERR(V fc ). Thus, the existence 
of such a V CUR would cast CUR directly as a randomized approximation algorithm for SPCA. How- 
ever, the following theorem states that unless an unrealistic constraint on X holds, there does not 
exist a matrix V CUR for which ERR(V CUR ) = ||X — X i X i+ X||f. The larger implication of this 
theorem is that CUR cannot be directly viewed as an SPCA-type method. 

Theorem 3. Let I C {1, . . . ,p} bean index set and suppose W G K pxp satisfies Wjc = 0. Then, 

|X - XWW + || F > ||X - X^+XHf, 
unless £ co i(X. x ) _L £ co i(X IC ), in which case ">" holds. 
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Figure 1 : Example of the difference in reconstruction errors (9) and (10), when additional constraints 
imposed. Left: regression with row-sparsity constraint (black) compared with PCA with low rank 
constraint (red). Right: regression with row-sparsity constraint (black) compared with PCA with 
low rank and sparsity constraint (green). In both plots, the corresponding errors are represented by 
the dotted lines. 



Proof. 

||X - XWW + ||| = ||X - X x WiW+|||, = ||X - X x Wi(Wf Wx) _1 W T |||, 
= ||X I -X I WiW+|||, + ||X IC |||, > ||x IC ||| 

= ||x xc -x x x x+ x x °||| + ||x x x x+ x x °||| 

= |X-X X X X+ X||| + ||X X X X+ X XC ||| > ||X - X X X X+ X|||,. 
The last inequality is strict unless X X X X+ X X =0. □ 

5 CUR-type sparsity and the group lasso SPCA 

Although CUR cannot be directly cast as an SPCA-type method, in this section we propose a sparse 
PCA approach (which we call the group lasso SPCA or GL-SPCA) that accomplishes something 
very close to CUR. Our proposal produces a V* that has rows that are entirely zero, and it is mo- 
tivated by the following two observations about CUR. First, following from the definition of the 
leverage scores (3), CUR chooses columns of X based on the norm of their corresponding rows of 
Vfe. Thus, it essentially "zeros-out" the rows of with small norms (in a probabilistic sense). 
Second, as we have noted in Section 4, if CUR could be expressed as a PCA method, its principal 
directions matrix "V CUR " would have p — c rows that are entirely zero, corresponding to removing 
those columns of X. 

Recall that Zou et al. [21] obtain a sparse V* by including in (5) an additional L\ penalty from 
the optimization problem (4). Since the L\ penalty is on the entire matrix viewed as a vector, 
it encourages only unstructured sparsity. To achieve the CUR-type row sparsity, we propose the 
following modification of (4): 

Problem 2 (Group lasso SPCA: GL-SPCA). Given an arbitrary matrix X G R" xp and an integer 
k, let A and W be p x k matrices, and let \,\\ > 0. The GL-SPCA problem is to solve 

p 

(A*, V*) = argmin AiW ||X - XWA T ||| + A||W|& + \i ^ ||W W || a s.t. A T A = I fe . (11) 

Thus, the lasso penalty Ai||W||i in (5) is replaced in (11) by a group lasso penalty 
Ai X)i=i l|W(i)||2, where rows of W are grouped together so that each row of V* will tend to 
be either dense or entirely zero. 

Importantly, the GL-SPCA problem is not convex in W and A together; it is, however, convex in 
W, and it is easy to solve in A. Thus, analogous to the treatment in Zou et al. [21], we propose 
an iterative alternate-minimization algorithm to solve GL-SPCA. This is described in Algorithm 1; 
and the justification of this algorithm is given in Section 7. Note that if we fix A to be I throughout, 
then Algorithm 1 can be used to solve the GL-Reg problem discussed in Section 3. 
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Algorithm 1: Iterative algorithm for solving the GL-SPCA (and GL-Reg) problems. 
(For the GL-REG problem, fix A = I throughout this algorithm.) 

Input: Data matrix X and initial estimates for A and W 

Output: Final estimates for A and W 

repeat 

Compute SVD of X T XW as UDV T and then A <- UV T ; 



S^{i: ||W W || 2 ^0}; 
for i e S do 

Compute b, = (X« T XW) W^; 

if ||A T X T X« -bi|| 2 < Ai/2then 



else 



Wfo <- 0; 



L w £) - m^&jm^ ( ATxTxW - b *) ; 



until convergence; 



We remark that such row-sparsity in V* can have either advantages or disadvantages. Consider, for 
example, when there are a small number of informative columns in X and the rest are not important 
for the task at hand [12, 14]. In such a case, we would expect that enforcing entire rows to be zero 
would lead to better identification of the signal columns; and this has been empirically observed in 
the application of CUR to DNA SNP analysis [14]. The unstructured V*, by contrast, would not 
be able to "borrow strength" across all columns of V* to differentiate the signal columns from the 
noise columns. On the other hand, requiring such structured sparsity is more restrictive and may 
not be desirable. For example, in microarray analysis in which we have measured p genes on n 
patients, our goal may be to find several underlying factors. Biologists have identified "pathways" 
of interconnected genes [16], and it would be desirable if each sparse factor could be identified with 
a different pathway (that is, a different set of genes). Requiring all factors of V* to exclude the same 
p — c genes does not allow a different sparse subset of genes to be active in each factor. 

We finish this section by pointing out that while most SPCA methods only enforce unstructured 
zeros in V*, the idea of having a structured sparsity in the PCA context has very recently been 
explored [9]. Our GL-SPCA problem falls within the broad framework of this idea. 

6 Empirical Comparisons 

In this section, we evaluate the performance of the four methods discussed above on both syn- 
thetic and real data. In particular, we compare the randomized CUR algorithm of Mahoney and 
Drineas [12, 4] to our GL-Reg (of Problem 1), and we compare the SPCA algorithm proposed 
by Zou et al. [21] to our GL-SPCA (of Problem 2). We have also compared against the SPCA 
algorithm of Witten et al. [19], and we found the results to be very similar to those of Zou et al. 

6.1 Simulations 

We first consider synthetic examples of the form X = X + E, where X is the underlying signal 
matrix and E is a matrix of noise. In all our simulations, E has i.i.d. A/"(0, 1) entries, while the 
signal X has one of the following forms: 

Case I) X = [0 nx ( p _ c ); X*] where the n x c matrix X* is the nonzero part of X. In other words, 
X has c nonzero columns and does not necessarily have a low-rank structure. 

Case II) X = UV T where U and V each consist of k < p orthogonal columns. In addition to 
being low-rank, V has entire rows equal to zero (i.e. it is row-sparse). 

Case III) X = UV T where U and V each consist of k < p orthogonal columns. Here V is 
low-rank and sparse, but the sparsity is not structured (i.e. it is scattered-sparse). 

A successful method attains low reconstruction error of the true signal X and has high precision in 
identifying correctly the zeros in the underlying model. As previously discussed, the four methods 
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optimize for different types of reconstruction error. Thus, in comparing CUR and GL-Reg, we 
use the regression-type reconstruction error ERR rog (I) = ||X — X i X i+ X||f, whereas for the 
comparison of SPCA and GL-SPCA, we use the PCA-type error ERR(V) = ||X - XVV+|| F . 

Table 1 presents the simulation results from the three cases. All comparisons use n = 100 and 
p = 1000. In Case II and III, the signal matrix has rank k = 10. The underlying sparsity level is 
20%, i.e. 80% of the entries of X (Case I) and V (Case II&III) are zeros. Note that all methods 
except for GL-Reg require the rank k as an input, and we always take it to be 10 even in Case I. For 
easy comparison, we have tuned each method to have the correct total number of zeros. The results 
are averaged over 5 trials. 



Methods 


Case I 


Case II 


Case III 


(T , CUR 

ERRrcgli.) GL . REG 


316.29 (0.835) 
316.29 (0.989) 


315.28 (0.797) 
315.28 (0.750) 


315.64 (0.166) 
315.64 (0.107) 


t\ SPCA 
err(V) gl . spca 


177.92 (0.809) 
141.85 (0.998) 


44.388 (0.799) 
37.310(0.767) 


44.995 (0.792) 
45.500 (0.804) 



Table 1: Simulation results: The reconstruction errors and the percentages of correctly identified 
zeros (in parentheses). 

We notice in Table 1 that the two regression-type methods CUR and GL-REG have very similar 
performance. As we would expect, since CUR only uses information in the top k singular vectors, it 
does slightly worse than GL-Reg in terms of precision when the underlying signal is not low-rank 
(Case I). In addition, both methods perform poorly if the sparsity is not structured as in Case III. The 
two PCA-type methods perform similarly as well. Again, the group lasso method seems to work 
better in Case I. We note that the precisions reported here are based on element- wise sparsity — if we 
were measuring row-sparsity, methods like SPCA would perform poorly since they do not encourage 
entire rows to be zero. 

6.2 Microarray example 

We next consider a microarray dataset of soft tissue tumors studied by Nielsen et al. [13]. Ma- 
honey and Drineas [12] apply CUR to this dataset of n = 31 tissue samples and p = 5520 genes. 
As with the simulation results, we use two sets of comparisons: we compare CUR with GL-Reg, 
and we compare SPCA with GL-SPCA. Since we do not observe the underlying truth X, we take 
ERR reg (X) = HX-X^+XHf and err(V) = ||X - XVV+|| F . Also, since we do not observe 
the true sparsity, we cannot measure the precision as we do in Table 1 . The left plot in Figure 2 
shows ERR rog (I) as a function of We see that CUR and GL-REG perform similarly. (However, 
since CUR is a randomized algorithm, on every run it gives a different result. From a practical 
standpoint, this feature of CUR can be disconcerting to biologists wanting to report a single set of 
important genes. In this light, GL-Reg may be thought of as an attractive non-randomized alterna- 
tive to CUR.) The right plot of Figure 2 compares GL-SPCA to SPCA (specifically, Zou et al. [21]). 
Since SPCA does not explicitly enforce row-sparsity, for a gene to be not used in the model requires 
all of the (k = 4) columns of V* to exclude it. This likely explains the advantage of GL-SPCA over 
SPCA seen in the figure. 

7 Justification of Algorithm 1 

The algorithm alternates between minimizing with respect to A and B until convergence. 

Solving for A given B: If B is fixed, then the regularization penalty in (11) can be ignored, in 
which case the optimization problem becomes miriA ||X - XBA T ||| subject to A T A = I. This 
problem was considered by Zou et al. [21], who showed that the solution is obtained by computing 
the SVD of (X T X)B as (X T X)B = UDV T and then setting A = UV T . This explains step 1 in 
Algorithm 1. 

Solving for B given A: If A is fixed, then (11) becomes an unconstrained convex optimization 
problem in B. The subgradient equations (using that A T A = Ifc) are 

2B T X T X^ - 2A T X T X^ + 2AB£ } + = 0; i = l,...,p, (12) 
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Microarray Dataset 



Microarray Dataset 




50 100 150 200 1000 2000 3000 4000 5000 

Number of genes used Number of genes used 



Figure 2: Left: Comparison of CUR, multiple runs, with GL-Reg; Right: Comparison of GL- 
SPCA with SPCA (specifically, ZouefaZ. [21]). 

where the subgradient vectors = B^/HB^I^ if B^j ^ 0, or ||sj|| 2 < 1 if B^j = 0. Let us 
define h t = Ej#i( xW)Tx(i) ) B y) = B T X T X^ -\\X^\\lBf t) , so that the subgradient equations 
can be written as 

bi + (||X»||i + A)B^ - A T X T X« + (Ai/2)s< - 0. (13) 

The following claim explains Step 3 in Algorithm 1 . 

Claim 1. B (i) = if and only i/||A T X T XW -b 2 || 2 < X 1 /2. 

Proof. First, if B (i) = 0, the subgradient equations (13) become b, - A T X T XW + (Ai/2)s s ; = 0. 
Since ||si|| 2 < 1 if B (i) = 0, we have ||A T X T X^ - bi|| 2 < A x /2. To prove the other 
direction, recall that B(j) ^ implies s, = Bj^/||B(j)|| 2 . Substituting this expression into 
(13), rearranging terms, and taking the norm on both sides, we get 2||A T X T X('' — bj|| 2 = 
(2||XW||l + 2A + A 1 /||B (i) || 2 )||B w || 2 >A 1 . ' □ 

By Claim 1, |A T X T X (i ) - bj|| 2 > Ai/2 implies that B (i) ^ which further implies s, = 
B^/||B(j)|| 2 . Substituting into (13) gives Step 4 in Algorithm 1. 

8 Conclusion 

In this paper, we have elucidated several connections between two recently-popular matrix decom- 
position methods that adopt very different perspectives on obtaining interpretable low-rank matrix 
decompositions. In doing so, we have suggested two optimization problems, GL-Reg and GL- 
SPCA, that highlight similarities and differences between the two methods. In general, SPCA 
methods obtain interpretability by modifying an existing intractable objective with a convex regu- 
larization term that encourages sparsity, and then exactly optimizing that modified objective. On 
the other hand, CUR methods operate by using randomness and approximation as computational re- 
sources to optimize approximately an intractable objective, thereby implicitly incorporating a form 
of regularization into the steps of the approximation algorithm. Understanding this concept of im- 
plicit regularization via approximate computation is clearly of interest more generally, in particular 
for applications where the size scale of the data is expected to increase. 
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